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» ;5T; Abstract : The invention relates to a method of synthesizing of a speech signal, the speech signal having at least a first speech 
.» ^cc^^nd speech unit, the method comprising the steps of: providing a first speech unit signal, the first speech unit signal 
having an end interval, providing a second speech unit signal, the second speech unit signal having a front interval, appending of at 
iijasii voiiic of the periods of the end interval in inverted order at the end of the first speech unit signal to provide a fade-out interval, 
a;jjA:;iuir.g of at least some of the periods of the front interval in inverted order at the beginning of the second speech unit signal to 

1^ provide a fade-in interval, supeiposing of the end and fade-in intervals and of the fade-out and fi-ont intervals. 
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SPEECH SYNTHESIS US ING CQNgATEKTATIOKT OF g Pg-T^mj ^fi-^rppr^py^o 

Present invention relates to the field of synthesizing of speech or music, and 
more particularly without limitation, to tihe field of text-to-speech synthesis. 

The fimction of a text-to-speech (TTS) synfliesis system is to synthesize 
speech firom a generic text in a given language. Nowadays, TTS systems have been put into 
5 practical operation for many applications, such as access to databases through the telephone 
network or aid to handicapped people. One method to synthesize speech is by concatenating 
elements of a recorded set of subunits of speech such as demi-syllables or polyphones. The 
majority of successfiil commercial systems employ the concatenation of polyphones. 



10 phones and may be determined firom nonsense words, by segmenting the desired grouping of 
phones at stable spectral regions. In a concatenation based synthesis, the conversation of the 
transition between two adjacent phones is cracial to assure the quality of the synthesized 
speech. With the choice of polyphones as the basic subunits, the transition between two 
adjacent phones is preserved in the recorded subunits, and the concatenation is carried out 

1 5 between similar phones. 



modified in order to fiilfil flie prosodic constraints of the new words containing those phones. 
This processing is necessary to avoid the production of a monotonous soimding synthesized 
speech. In a TTS system, this fimction is performed by a prosodic module. To allow the 
20 duration and pitch modifications in the recorded subunits, many concatenation based TTS 

systems employ the time-domain pitch-synchronous overlap-add (TD-PSOLA) (E. Moulines 
and F. Charpentier, "Pitch synchronous waveform processing techniques for text-to-speech 
synthesis using diphones," Speech Commun., vol, 9, pp. 453-467, 1990) model of synthesis. 



25 marking algorithm. This algorithm assigns marks at the peaks of the signal in the voiced 

segments and assigns marks 10 ms apart in the unvoiced segments. The synthesis is made by 
a superposition of Hanning windowed segments centered at the pitch marks and extending 
from tiie previous pitch mark to the next one. The duration modification is provided by 
deleting or replicating some of the windowed segments. The pitch period modification, on 



The polyphones comprise groups of two (diphones), three (triphones) or more 



Before the synthesis, however, the phones must have their duration and pitch 



In the TD-PSOLA model, the speech signal is first submitted to a pitch 
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the other hand, is provided by increasing or decreasing the superposition between windowed 
segments. 

Despite the success achieved in many commercial TTS systems, the synthetic 
speech produced by using the TD-PSOLA model of synthesis can present some drawbacks, 
5 mainly under large prosodic variations. 

Example of such PSOLA methods are those defined in documents EP- 
0363233, U.S. Pat. No. 5,479,564, EP-0706170. A specific example is also the MBR-PSOLA 
method as published by T. Dutoit and H. Leich, in Speech Commimication, Elsevier 
Publisher, November 1993, vol. 13, N.degree. 3-4, 1993. The method described in document 
lO U.S. Pat. No. 5,479,564 suggests a means of modifying the fi-equency by overlap-adding 
short-term signals extracted from this signal. The length of tiie weighting windows used to 
obtain the short-term signals is approximately equal to two times the period of the audio 
signal and their position within the period can be set to any value (provided the time shift 
between successive windows is equal to the period of the audio signal). Document U.S. Pat. 
15 No. 5,479,564 also describes a means of interpolating waveforms between segments to 

concatenate, so as to smooth out discontinuities. In prior art text-to-speech systems a set of 
pre-recorded speech firagments can be concatenated in a specific order to convert a certain 
text into natural sounding speech. Text-to-speech systems that use small speech firagments 
have many such concatenation points. Especially when the speech firagments are spectrally 
20 different, these joins produce artefacts that reduce the intelligibility. In particular, when two 
speech segments &om different recording times are to be concatenated, the resulting speech 
can have a discontinuity at the joint of the two segmmts. For example, when a vowel is 
synthesized, the lefl: part mostly comes from a different recording than the right part. This 
makes it impossible to reproduce the exact color of a vowel. 
25 The slight differences in the formant trajectories produce a sudden jump at the 

joint location. What is mostly done in the prior art to reduce this effect is to re-record the 
speech fragment imtil it matches with the rest or add different versions (extra fragments) to 
minimize the difference. 

The present invention therefore aims to provide an improved method of 
30 synthesizing of a speech signal, the speech signal having at least a first diphone and a second 
diphone. The present invmtion further aims to provide a corresponding computer program 
product and computer system, in particular text-to-speech system. 

The present invention provides for a method of synthesizing of speech signal 
based on first and second diphone signals which are superposed at their joint. The invention 



wo 2004/027756 PCT/IB2003/003624 

3 

enables a smooth concatenation of the diphone signals without any audible artefacts. This is 
accomplished by appending periods of an end interval of the first diphone signal in inverted 
order at the end of the first diphone signal and by appending periods of a firont interval of the 
second diphone signal at the beginning of the second diphone signal. The end and firont 
intervals are overlapped to produce the smooth transition. 

In accordance with an embodiment of the invention the &nd and fix>nt intervals 
of the first and second diphone signal are identified by a marker. Preferably the end and firont 
intervals contain periods which are about steady, i.e. which have approximately the same 
information content and signal form. Such end and fi-ont intervals can be identified by a 
human expert or by means of a corresponding computer program. Preferably the first analysis 
is performed by means of a computer program and the result if reviewed by a human expert 
for increased precision. 

In accordance with a fiirther embodiment of the invention the last period of the 
end interval and the first period of the firont interval are not appended. This has the advantage 
that no periodicity is introduced into the signal by the immediate repetition of two identical 
periods. 

In accordance witti a finrther embodiment of the invention a windowing 
operation is performed on the end and firont intervals as well as on the respective appended 
periods by means of fade-out and fade-in windows, respectively. Preferably a raised cosine 
window fimction is used for voiced end intervals and the appended periods, whereas for 
unvoiced end intervals and the appended periods a sine window is used as a fade-out 
window. Likewise a raised cosine is used as a window fimction for smoothening the 
beginning of a voiced segment of the second diphone or a sine window for imvoiced 
segments. 

In accordance with an embodiment of the invention a duration adaptation is 
performed for the intervals to be overlapped. Especially if the intervals have different 
durations this is advantageous in order to avoid the introduction of abrupt signal transitions. 

In accordance with a fiurther embodiment of the invention, text-to-speech 
processing is performed by concatenating diphones in accordance with the principles of the 
present invention. This way a natural sounding speech output can be produced. 

It is important to note that the present invention is not restricted to the 
concatenation of diphones but can also be advantageously employed for the concatenation of 
other speech units such as triphones, polyphones or words. 
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In the following embodiments of the invention are described in greater detail 
by making reference to the drawings in which: 

Fig. 1 depicts a flow chart of a preferred embodiment of a method of the 

invention, '"^ ^ 

Fig. 2 depicts the interleaved repetition of periods at the end and the front of 
the original diphone signals, 

^Fig^^^epicts an example for a signal synthesis, and 

Fig. 4 depicts a block diagram of an embodiment of a text-to-speech system. 



Fig. 1 shows a flow diagram which illustrates a preferred embodiment of a 
method of the present invention. In step 100 a first diphone signal A is provided. The diphone 
signal A has at least one marker which identifies an end interval of the diphone A signal. 

In step 102 periods within the end interval of the diphone signal A are 
repeated in inverted order in order to provide a fade-out interval which is appended at the end 
of the end interval. In step 104 the end interval with its' appended fade-out interval are 
windowed by means of a fade-out window fimction in order to smoothly fade out the diphone 
signal at its' end. Likewise a diphone signal B is provided in step 106. The diphone signal B 
has at least one associated marker in order to identify a front segment of the diphone signal 
B. In step 108 at least some of the front intervals periods are appended at the beginning of the 
front interval of the diphone signal B in inverted order. This way a fade-in interval is 
provided. In step 1 10 the front interval and the appended fade-in interval are windowed by 
means of a fade-in window. This way a smooth beginning of the diphone signal B is 
provided. In step 112 a duration adaptation is performed. This means that the durations of the 
end and front intervals of the diphone signals A and B are modified such that the end and 
fade-in intervals have the same duration. Likewise the durations of the fade-out and front 
mtervals are adapted. In step 114 an overlap and add operation is performed on the diphone 
signals A and B with the processed end and fade-in intervals and the fade-out and front 
intervals. This way a smooth concatenation of the diphone signals A and B is accomplished. 
For voiced segments usage of tiie following raised cosine window fimction is preferred: 

Min] = 0.5 - 0.5 • co/ ^-^" + Q-^>\ 0 < « < m 
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where m is the total number of periods in the smoothing range. 
For unvoiced segments, a sine window is used: 

>i(n]=suil ^ -I 0^n<m 

The advantage of using a sine-window is that this ensures that the total signal 
envelope in power-domain remains constant. Unlike a periodic signal, when two noise 
samples are added, the total sum can be smaller than the absolute value of any of the two 
samples. This is because the signals are (mostly) not in-phase. The sine-window adjusts for 
this effect and removes the envelope-modulation. 

Fig. 2 illustrates the process of appending interval periods in inverted order 
(cf. steps 102 and 108 of figure 1). Time axis 200 illustrates the time domain of diphone 
signal A. The diphone signal A has an end interval 202 which contains periods pi, p2, . . . , Pi, 
. . pn-1, Pn. hi order to provide fade-out interval 204 periods pi of the end interval 202 are 
appended at the end of the end interval 202 in inverted order. The last period pn of the end 
interval 202 is not appended in order to avoid a repetition of two identical periods which 
would introduce an unintended periodicity. Such a periodicity could become audible imder 
certam circumstances. It is therefore preferred not to repeat the least period pn of the end 
interval 202. The first period p*i of the fade-out interval 204 is provided by copying the 
signal of period pn-i- In general, period p'j of fade-out interval 204 is obtained by appending 
period pN-j fi'om the end interval 202, i.e. p'j = pw-j - Time axis 206 is illustrative of the time 
domain of diphone signal B. Diphone signal B has a front mterval 208 containing periods Pi, 
P2, . . . , Pi, . . Pn-i, Pn- Fade-in interval 210 is provided by appending periods from front 
interval 208 at the begiiming of front interval 208 in inverted order. Again it is preferred not 
to append the first period Piof the front interval 208 to avoid the introduction of unintended 
periodicity. In the general case a signal period P'j is obtained from the period Pn-j+i of the 
front interval 208, i.e. P'j = PN-j+i For concatenating the diphone signal A and the diphone 
signal B, the end interval 202 and tiie fade-in interval 210 are overlapped and added as well 
as the fade-out interval 204 and front interval 210. In the example considered here this can be 
done without adapting the durations of the respective intervals, as the durations of the end 
interval 202 and the fade-in interval 210 as well as the durations of the fade-out interval 204 
and the front interval 208 are the same. 

Fig. 3 shows an example for the various synthesis steps for the word *young'. 
This word is made of the phonemes /j/, /V/, /N/ and the silence /_/. a) and b) are the recorded 
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nonsense words that contain the transitions from /j/ to Nl and fWI to fi^L Within each 
nonsense word five markers are placed. The outer markers are the diphone borders (labels j-, 
-V, V- and -N). The markers in the middle show where a new phoneme starts (labels V, and 
N). The other labels are used to mark the segments that will be used for overlap-add. As it is 
illustrated in the diagram (c) of figure 3 the periods of tiie end interval 300 are repeated in 
mverted order to provide a fade-out interval 302. All the periods within end interval 300 are 
appended after period 304 which is the last period of the end interval 300. Period 304 itself is 
not appended to avoid the repetition of the same period which would introduce an unintended 
periodicity. Likewise for the diphone signal of diagram (b) of figure 3 the periods within 
front mterval 306 are appended at the beginning of the front interval 306 in inverted order. 
This appUes for all of the period within the front interval 306 except the first period 3 10 at 
the beginning of the front interval 306. Again this period 310 is not appended in order to 
avoid two consecutive identical periods which would introduce an unintended periodicity. 
The same kind of processing is done for the front interval 3 12 of the diphone signal of the 
diagram (a) and for the end interval 314 of the diphone signal of diagram (b). Further the 
same approach is applied to the fiirther diphones which are required to be concatenated for 
the synthesis of the word *young'. Next a smoothening window is applied to the front, end, 
fade-in and fade-out intervals. For voiced segments a raised cosine is preferably used as a 
window fimction. The following window fimction is employed for the fade-in and front 



corresponding raised cosine is shown as raised cosine 316 in diagram (d). A corresponding 
window fimction is used to provide raised cosine 318 for the end and fade-out intervals 300 
and 302. As it is illustrated in the diagram (e) the durations of the intervals to be overlapped 
and added, i.e. intervals 300/308 and intervals 302/306 are rescaled in order to bring them to 
an equal length. The following superposition of the required diphone provides the synthesis 
uf the word *young\ 

Fig. 4 shows a block diagram of computer system 400, which is a text-to- 
speech system. The computer system 400 has module 402 which serves to store diphones and 
markers for the diphones to indicate front and end intervals. Module 404 serves to repeat 

contained in the end and front intervals in inverted order in order to provide fade-in 
ctqA fflrte-out intervals. Module 406 serves to provide a window fimction for windowing the 



intervals: 




where m is the total number of periods in the smoothening range. The 
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end/fade-out and fade-in/front intervals for the purposes of smoothening. Module 408 serves 
for duration adaptation of the intervals to be superposed. Such a duration adaptation is 
required if the intervals to be superposed are not of equal length. Module 410 serves for the 
superposition of the end/fade-in and of the fade-out/firont intervals in order to concatenate 
5 their required diphones. When text is entered into the computer system 400 the required 

diphones to be concatenated are selected from module 402. These diphones are processed by 
means of modules 404, 406 and 408 before they are overlapped and added by means of 
module 410, which results in the required synthesized speech signal. 
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1 . A method of synthesizing of a speech signal, the speech signal having at least 

a first speech unit and a second speech unit, the method comprising the steps of: 

providing a first speech unit signal, the first speech unit signal having an end 

interval, 

5 - providing a second speech utut signal, the second speech imit signal having a 

firont interval, 

sqppending of at least some of the periods of the end interval in inverted order 
at the end of the first speech unit signal to provide a fade-out interval, 

appending of at least some of the periods of the front interval in inverted ordCT 
10 at the beginning of the second speech unit signal to provide a fade-in interval, 

superposing of the end and fade-in intervals and of tihe fade-out and front 

intervals. 



2. The method of claim 1, whereby the end and front intervals have 
1 5 approximately steady periods. 

3. The method of claim 1 or 2, the end and front intervals being identified by a 
marker. 

20 4. The method of claim 1, 2 or 3, whereby the last period of the end interval and 

the first period of the front interval are not appended. 

5. The method of any one of the preceding claims 1 to 4, further comprising 
windowing of the end and/or fade-out intervals with a fade-out window. 

25 

6. The method of claim 5, whereby a raised cosine is used as a fade-out window. 

7. The method of claim 6, whereby the following window function is used for 
voiced intervals: 

M{723 = 0.5-0.5>co/ '^'^''"^°'^\ 0<n<m 



5 



10 
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where m is the total number of periods in a smoothening range. 

8. The method of claim 5, whereby a sine window is used as a fade-out window 
for unvoiced intervals. 

9. The method of claim 8, whereby the following window function is used: 
vi{n] = suil 0<n<m (2.7) 

where m is the total number of periods in a smoothening range. 

10. The method of any one of the preceding claims 1 to 9, the first and second 
speech imits being diphones and/or triphones and/or polyphones, in particular words. 

1 1 . The method of any one of the preceding claims 1 to 10, further comprising 

1 5 adapting the durations of the end and fade-in intervals and of the fade-out and front intervals. 

12. The methods of any one of the preceding claims 1 to 1 1 whereby the speech 
signal is synthesized by means of an overlap and add operation. 

20 13. Computer program product, in particular, digital storage medium, comprising 

program means for synthesizing of a speech signal, the speech signal having at least a first 
speech unit and a second speech unit, the program means being adapted to perform ttie steps 
of: 

providing a first speech unit signal, the first speech unit signal having an end 

25 interval, 

providing a second speech unit signal, the second speech imit signal having a 

front interval, 

appending of at least some of the periods of the end interval in inverted order 
at the end of the first speech imit signal to provide a frtde-out interval, 
30 - appending of at least some of the periods of the front interval in inverted order 

at the beguming of the second speech unit signal to provide a fade-in interval, 

superposing of the end and fade-in intervals and of the fade-out and front 

intervals. 



wo 2004/027756 PCT/IB2003/003624 

10 

14. Computer system, in particular text-to-speech system, for synthesizing of a 

speech signal, the speech signal having at least a first speech unit and a second speech unit, 
the computer system comprising: 

means for storing of a first speech unit signal, the first speech unit signal 
5 having an end interval, and for storing of a second speech unit signal, the second speech unit 
signal having a firont interval, 

means for appending of at least some of the periods of the end interval in 
inverted order at the end of tiie first speech unit signal to provide a fade-out interval, 

means for appending of at least some of the periods of the firont interval in 
10 inverted order at the beginning of the second speech unit signal to provide a fede-in interval, 

means for superposing of the end and fade-in intervals and of the fade-out and 

firont intervals. 
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